{"id":15657830,"url":"https://github.com/pythonicninja/agregacje3_exam","last_synced_at":"2026-03-15T00:41:40.313Z","repository":{"id":141913801,"uuid":"28720037","full_name":"PythonicNinja/agregacje3_exam","owner":"PythonicNinja","description":null,"archived":false,"fork":false,"pushed_at":"2015-01-07T17:33:30.000Z","size":520,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-05T04:46:15.835Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PythonicNinja.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-01-02T17:13:25.000Z","updated_at":"2015-01-07T17:33:30.000Z","dependencies_parsed_at":"2023-03-13T10:27:41.741Z","dependency_job_id":null,"html_url":"https://github.com/PythonicNinja/agregacje3_exam","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonicNinja%2Fagregacje3_exam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonicNinja%2Fagregacje3_exam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonicNinja%2Fagregacje3_exam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PythonicNinja%2Fagregacje3_exam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PythonicNinja","download_url":"https://codeload.github.com/PythonicNinja/agregacje3_exam/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246266592,"owners_count":20749819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-03T13:09:55.670Z","updated_at":"2025-12-17T23:36:50.582Z","avatar_url":"https://github.com/PythonicNinja.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- START doctoc generated TOC please keep comment here to allow auto update --\u003e\n\u003c!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --\u003e\n**Table of Contents**  *generated with [DocToc](http://doctoc.herokuapp.com/)*\n\n- [Wojciech Nowak](#wojciech-nowak)\n      - [Sprzęt](#sprzęt)\n      - [Wersje oprogramowania](#wersje-oprogramowania)\n- [Zadania 3](#zadania-3)\n  - [Przygotować funkcje map i reduce, które:](#przygotować-funkcje-map-i-reduce-które)\n    - [wyszukają wszystkie anagramy w pliku word_list.txt](#wyszukają-wszystkie-anagramy-w-pliku-word_listtxt)\n      - [MongoDB](#mongodb)\n        - [Przygotowanie](#przygotowanie)\n        - [Import](#import)\n        - [Map reduce anagram list](#map-reduce-anagram-list)\n    - [wyszukają najczęściej występujące słowa z Wikipedia data PL aktualny plik z artykułami, ok. 1.3 GB](#wyszukają-najczęściej-występujące-słowa-z-wikipedia-data-pl-aktualny-plik-z-artykułami-ok-13-gb)\n      - [import](#import)\n      - [Map Reduce  w wersji zoptymalizowanej na wiele wątków.](#map-reduce--w-wersji-zoptymalizowanej-na-wiele-wątków)\n        - [ScopedThread](#scopedthread)\n\n\u003c!-- END doctoc generated TOC please keep comment here to allow auto update --\u003e\n\n# Wojciech Nowak \n* nr albumu 206354 Informatyka I rok Magisterskie\n\n####Sprzęt\n* procesor: \n\n\t\tModel:\t\t\t\t\t\t\t\t  Intel Core i7-4650U\n\t\tSzybkość procesora:\t\t\t\t\t1,7 GHz  \n\t\tŁączna liczba rdzeni:\t\t\t\t\t2    \n\t\tPamięć podręczna L2 (na rdzeń): \t  256KB    \n\t\tPamięć podręczna L3: \t\t\t\t 4MB\n\t\t\n* pamięć ram:\n\n\t\tRozmiar: \t\t8GB \n\t\tTyp:\t\t\tDDR3\n\t\tPrędkość:\t   1600MHz\n\n* dysk: \n\n\t\tmodel: \t\t\tSSD Samsung sm0256f \n\t\tmagistrala:\t\tPCIe SSD\n\t\t\n####Wersje oprogramowania\n* MongoDB shell version: \t\n        2.8.0-rc4\n* OS: Mac OS X \t\t\t\t10.10 (14B25)\n* Python:\n       2.7.8\n       \n#Zadania 3\n\n##Przygotować funkcje map i reduce, które:\n\n###wyszukają wszystkie anagramy w pliku word_list.txt\n\n####MongoDB\n\nkomenda importu:\n\n\t\u003e\u003e time mongoimport --type csv -c Words --file anagram/word_list.txt -f word\n\t2015-01-02T19:11:38.073+0100    connected to: localhost\n    2015-01-02T19:11:38.271+0100    imported 8199 documents\n    \n    real    0m0.224s\n    user    0m0.059s\n    sys     0m0.016s\n\nkomenda wstępna map reduce bez ScopedThread, którego użyjemy do importu wikiepdii.\n\n```js\ndb.Words.mapReduce(\n   function() {\n       var letters = this.word.split('');\n       letters = letters.sort();\n       emit(letters, this.word);\n   },  //map function\n   function(key,values) {\n        var all=\"\";\n        for(var i in values) {\n            all+=values[i]+\",\";\n        }\n        return all;\n   },\n   {\n      out: \"words_analysis\"\n   }\n);\n```\n\ndoszedłem do wniosku, że mogę korzystając z frameworka meteor oraz łatwego dostępu do więszej bazy polskich słów\n[jak http://sjp.pl/slownik/odmiany/](http://sjp.pl/slownik/odmiany/) \n\nzbuduje aplikacje wyszukującą anagramy w czasie rzeczywistym, najpierw pobrałem baze jednak wymaga\n\n#####Przygotowanie\n    \n1. zamiany kodowania z win-1250 na utf-8:\n\n\n    \u003etime iconv -f CP1250 -t utf-8 odm.txt \u003e odm_utf-8.txt\n    \n    real\t0m0.988s\n    user\t0m0.860s\n    sys\t0m0.111s\n    \n\n2. usunięcia \",\" zastąpienia ich znakiem nowej linni.\n\n\n    \u003e time cat odm_utf-8.txt | tr -d \" \"  | tr \",\" \"\\n\" \u003e odm.csv\n    \n    real\t0m7.668s\n    user\t0m14.790s\n    sys\t0m0.229s\n\n\n3. baza ma słów\n    \n    \n    \u003e cat odm.csv | wc -l \n    3 827 632\n\n\n#####Import\n\n1. zajmiemy się importem do mongodb meteor js\n\n    \u003e\u003e time mongoimport -h localhost:3001 -d meteor -c words -f word --type csv --file odm.csv \n    2015-01-06T01:32:21.502+0100\timported 3827632 documents\n\n    real\t1m48.404s\n    user\t0m30.262s\n    sys\t    0m2.449s\n    \n#####Map reduce anagram list\n\n```js\ndb.words.mapReduce(\n   function() {\n       var letters = this.word.split('');\n       letters = letters.sort();\n       emit(letters.toString(), this.word);\n   },  //map function\n   function(key,values) {\n        return values.toString();\n   },\n   {\n      out: \"anagrams\"\n   }\n);\n```\ntime około 7 min\n\n![single_thread_map_reduce](anagram/images/single_thread_map_reduce.png)\n\n\n###wyszukają najczęściej występujące słowa z Wikipedia data PL aktualny plik z artykułami, ok. 1.3 GB\n\n####import\n\n1. Pobranie xml'a (38mb, couple minutes)\n\n\n    \u003e wget http://dumps.wikimedia.org/plwiki/20141228/plwiki-20141228-pages-articles-multistream.xml.bz2\n\n\n2. unzip it (180mb, couple seconds)\n\n\n    \u003e bunzip2 ./plwiki-20141228-pages-articles-multistream.xml.bz2\n\n\n3. load it into mongo\n\n\n    \u003e time node index.js plwiki-20141228-pages-articles-multistream.xml\n    \n    =================done========\n    real    288m21.964s\n    user    257m31.175s\n    sys     6m15.147s\n\n    \n####Map Reduce  w wersji zoptymalizowanej na wiele wątków.\n\n##### ScopedThread\n    niestety ScopedThread został usunięty w wersji 2.6, ponieważ był on tylko używany do pisania testów\n    https://jira.mongodb.org/browse/SERVER-13485\n\n    Istnieje obejście tego problemu poprzez załadowanie\n\n    \u003e\u003e curl -O https://raw.githubusercontent.com/mongodb/mongo/master/jstests/libs/parallelTester.js\n\n1. Załadowanie ScopedThread\n\n\n        // curl -O https://raw.githubusercontent.com/mongodb/mongo/master/jstests/libs/parallelTester.js\n    \n    \n        load(\"parallelTester.js\");\n    \n    \n2. Podział na paczki:\n    \n\n    var res = db.runCommand({splitVector: \"pl_wikipedia.wikipedia\", keyPattern: {_id: 1}, maxChunkSizeBytes: 4 * 1024 *1024 * 1024 });\n    \n    \n3. Funkcja odpalana w wątkach:\n\n\n    var mapred = function(id, page) {\n        return db.runCommand({\n                mapreduce: \"wikipedia\",\n                map: function () {\n                    if(this.hasOwnProperty('text')) {\n                        var words = [];\n                        for (var name_text in this.text) {\n                            for (var i = 0; i \u003c this.text[name_text].length; i++) {\n                                words = words.concat(this.text[name_text][i].text.split(' '));\n                            }\n                        }\n                        for (var i = 0; i \u003c= words.length; i++)\n                            emit(words[i], 1);\n                    }\n                },\n                reduce: function (key, values) { return Array.sum(values); },\n                out: { replace: \"mrout\" + id, db: \"mrdb\" + id },\n                sort: {_id: -1},\n                query: { _id: { $lt: id} },\n                limit: page\n            })\n    };\n    \n4. odpalenie\n\n\n        \u003e\u003e load(\"map_reduce.js\");\n        2015-01-05T14:56:28.966+0100 I CONTROL  [initandlisten]\n        id:54aca1cc2c9d2a4b4ea38e9f\n        id:54acad102c9d2a4b4ea75f30\n        id:54acb9692c9d2a4b4eab2fc1\n        id:54acc5b22c9d2a4b4eaf0053\n        id:54accfdf2c9d2a4b4eb2d0e4\n        connecting to: pl_wikipedia\n        connecting to: pl_wikipedia\n        connecting to: pl_wikipedia\n        connecting to: pl_wikipedia\n        connecting to: pl_wikipedia\n    \n    \n5. Analiza:\n\n        Liczba emitów\n\n![paraller_emits](wikipedia/images/paraller_emits.png)\n\n\n        Faktycznie zużyte są wszystkie rdzenie logiczne i fizyczne.\n\n![paraller_performance](wikipedia/images/paraller_performance.png)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythonicninja%2Fagregacje3_exam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpythonicninja%2Fagregacje3_exam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythonicninja%2Fagregacje3_exam/lists"}