{"id":23465525,"url":"https://github.com/agentds/inf-553","last_synced_at":"2026-02-26T07:47:39.109Z","repository":{"id":165020381,"uuid":"204526073","full_name":"AgentDS/INF-553","owner":"AgentDS","description":"INF 553 2019 Fall Semester, USC Viterbi School. The code for homework is not allowed to published, so only tips for homework were published. Hope these will help you.","archived":false,"fork":false,"pushed_at":"2020-02-02T07:54:29.000Z","size":269497,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-27T22:04:52.882Z","etag":null,"topics":["data-mining","inf553","usc"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AgentDS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-26T17:23:09.000Z","updated_at":"2020-12-19T07:27:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"3264c93f-f319-4504-bfdf-9db6b5feaf81","html_url":"https://github.com/AgentDS/INF-553","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AgentDS/INF-553","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgentDS%2FINF-553","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgentDS%2FINF-553/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgentDS%2FINF-553/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgentDS%2FINF-553/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AgentDS","download_url":"https://codeload.github.com/AgentDS/INF-553/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgentDS%2FINF-553/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274434044,"owners_count":25284429,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-10T02:00:12.551Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-mining","inf553","usc"],"created_at":"2024-12-24T11:29:43.351Z","updated_at":"2026-02-26T07:47:34.089Z","avatar_url":"https://github.com/AgentDS.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# INF-553\n\nINF-553 2019 fall, by Prof. Anna Farzindar\n\n\u003e __The code for homework is not allowed to published, so only tips for homework were published. Hope these will help you.__\n\n[TOC]\n\n## Basic Course Information\n\n- Lec1: Introduction \u0026 Large-Scale File System \u0026 MapReduce1\n- Lec2: MapReduce2 \u0026 3\n- Lec3: Find Frequent Itemsets 1\n- Lec4: Frequent Items 2 \u0026 3\n- Lec5: Find Similar Sets 1 \u0026 2\n- Lec6: Find Similar Sets 3\n- Lec7: Recommender System 1 \u0026 2\n- Lec8: Recommender System 3 \u0026 4\n- Lec9: Social Networks 1\n- Lec10: Social Networks 2 \u0026 Clustering\n- Lec11: Link Analysis\n- Lec12: Mining Data Streams\n\n\n\n\n\n\n\n## Homework\n\n__Environment:__ macOS Mojave 10.14.5\n\n__Requirements:__ Python 3.6, Scala 2.11 and Spark 2.3.3\n\n\u003e Except the requirements above, you can only use standard python libraries\n\n\n\n### Install\n\nBefore installing, make sure you have [Anaconda](https://www.anaconda.com/distribution/) as well as [Homebrew](https://brew.sh/).\n\n#### Install Java JDK\n\nDownload Java JDK from the [link](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html), and open the ``dmg`` file to install it.\n\nThen add the ``JAVA_HOME`` in ``~/.zshrc`` file (if using bash, just add this to ``~/.bash_profile`` file):\n\n```bash\nexport JAVA_HOME=\"/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home\"\n```\n\nsave the changes and activate the change in the terminal:\n\n```bash\nsource ~/.zshrc\n```\n\nor for bash:\n\n```bash\nsource ~/.bash_profile\n```\n\nSome other installment tutorial suggests to set ``JAVA_HOME`` as ``usr/lib/jvm/xxx.jdk``, but this didn't work in my system.\n\nNow check whether you have installed Java JDK successfully in terminl:\n\n```bash\njava -version\n```\n\nMine is![javaversion](./Note/pic/javaversion.png)\n\n\n\n\u003e I used to try to install Java via ``brew``:\n\u003e\n\u003e ```bash\n\u003e brew cask install java\n\u003e ```\n\u003e\n\u003e which install Java 12. __However__, it seemed there's some problem with Java 12 and Spark 2.3.3, so I finally used Java 8 JDK ([this problem was mentioned here](https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec)).\n\n\n\n#### Install Spark\n\nOpen the [download link](https://archive.apache.org/dist/spark/spark-2.3.3/) and choose the version you want. Here I chose ``spark-2.3.3-bin-hadoop2.7.tgz``.\n\nWhen download finished, unzip it and move it to your ``/opt`` folder:\n\n```bash\ntar -xzf spark-2.3.3-bin-hadoop2.7.tgz\nmv spark-2.3.3-bin-hadoop2.7 /opt/spark-2.3.3\n```\n\nCreate a symbolic link (assuming you have multiple spark versions):\n\n```bash\nsudo ln -s /opt/spark-2.3.3 /opt/spark\n```\n\nThen add Spark path in the ``~/.zshrc`` or ``~/.bash_profile`` file and activate it:\n\n```bash\nexport SPARK_HOME=\"/opt/spark\"\nexport PATH=\"$SPARK_HOME/bin:$PATH\"\n```\n\nNow, run the example to see whether install successfully:\n\n```bash\ncd /opt/spark/bin\n# use grep to get clean output result\nrun-example SparkPi 2\u003e\u00261 | grep \"Pi is\"  \n```\n\nMy result:![sparkexample](./Note/pic/sparkexample.png)\n\n\n\n#### Install Pyspark in conda environment\n\nCreate new conda environment:\n\n```bash\nconda create -n inf553 python=3.6\n```\n\nand activate the environment when finished\n\n```bash\nconda activate inf553\n```\n\nNow install ``Pyspark`` using ``pip`` :\n\n```bash\npip install pyspark==2.3.3\n```\n\n\u003e I tried to install ``pyspark==2.4.4``, but this could cause incompatibility with Spark JVM libraries since Spark 2.3.3 is used!!! If use ``pyspark==2.4.4``, running test file showed [here](#testfile)  will end with error:\n\u003e\n\u003e ```python\n\u003e Traceback (most recent call last):\n\u003e   File \"test.py\", line 5, in \u003cmodule\u003e\n\u003e     numAs = logData.filter(lambda line: 'a' in line).count()\n\u003e   File \"//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py\", line 403, in filter\n\u003e     return self.mapPartitions(func, True)\n\u003e   File \"//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py\", line 353, in mapPartitions\n\u003e     return self.mapPartitionsWithIndex(func, preservesPartitioning)\n\u003e   File \"//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py\", line 365, in mapPartitionsWithIndex\n\u003e     return PipelinedRDD(self, f, preservesPartitioning)\n\u003e   File \"//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py\", line 2514, in __init__\n\u003e     self.is_barrier = prev._is_barrier() or isFromBarrier\n\u003e   File \"//anaconda3/envs/inf553/lib/python3.6/site-packages/pyspark/rdd.py\", line 2414, in _is_barrier\n\u003e     return self._jrdd.rdd().isBarrier()\n\u003e   File \"//anaconda3/envs/inf553/lib/python3.6/site-packages/py4j/java_gateway.py\", line 1257, in __call__\n\u003e     answer, self.gateway_client, self.target_id, self.name)\n\u003e   File \"//anaconda3/envs/inf553/lib/python3.6/site-packages/py4j/protocol.py\", line 332, in get_return_value\n\u003e     format(target_id, \".\", name, value))\n\u003e py4j.protocol.Py4JError: An error occurred while calling o23.isBarrier. Trace:\n\u003e py4j.Py4JException: Method isBarrier([]) does not exist\n\u003e         at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)\n\u003e         at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)\n\u003e         at py4j.Gateway.invoke(Gateway.java:274)\n\u003e         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\u003e         at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\u003e         at py4j.GatewayConnection.run(GatewayConnection.java:238)\n\u003e         at java.lang.Thread.run(Thread.java:748)\n\u003e ```\n\nAdd these codes to ``~/.zshrc`` or ``~/.bash_profile`` file to enable python/ipython from conda environment to use PySpark:\n\n```bash\nexport PYSPARK_PYTHON=\"/anaconda3/envs/inf553/bin/python\"\nexport PYSPARK_DRIVER_PYTHON=\"/anaconda3/envs/inf553/bin/ipython\"\n```\n\nThen activate the changes:\n\n```bash\nsource ~/.zshrc\n```\n\nor\n\n```bash\nsource ~/.bash_profile\n```\n\n\n\nNow run the \u003ca name=\"testfile\"\u003e``test.py``\u003c/a\u003e using ``spark-submit test.py``  to see the result. the ``test.py`` is showed below:\n\n```python\n# test.py\nfrom pyspark import SparkContext\nsc = SparkContext( 'local', 'test')\nlogFile = \"file:///opt/spark/README.md\"\nlogData = sc.textFile(logFile, 2).cache()\nnumAs = logData.filter(lambda line: 'a' in line).count()\nnumBs = logData.filter(lambda line: 'b' in line).count()\nprint('Lines with a: %s, Lines with b: %s' % (numAs, numBs))\n```\n\nThe running result is\n\n```python\nLines with a: 61, Lines with b: 30\n```\n\n\u003e There might be a lot of LOG informations in the print out result when running the ``spark-submit`` as showed:![loginfo](./Note/pic/loginfo.png)\n\u003e\n\u003e  We can manage to hide them. \n\u003e\n\u003e Go to the spark configure folder:\n\u003e\n\u003e ```\n\u003e cd /opt/spark/conf\n\u003e ```\n\u003e\n\u003e Copy the ``log4j.properties.template`` file and edit the copy:\n\u003e\n\u003e ```bash\n\u003e cp log4j.properties.template log4j.properties\n\u003e vim log4j.properties\n\u003e ```\n\u003e\n\u003e Then you can see in the Lin 19, there is ``log4j.rootCategory=INFO, console``:\n\u003e\n\u003e ![sparklog](./Note/pic/sparklog.png)\n\u003e\n\u003e Change ``INFO`` to ``WARN`` and save the file. Then it is done!\n\n\n\n\n\n#### Install Scala\n\nDownload scala 2.11.8 from the [link](https://www.scala-lang.org/download/2.11.8.html), then run:\n\n```bash\nsudo tar -zxf scala-2.11.8.tgz -C /usr/local\ncd /usr/local/\nsudo mv ./scala-2.11.8/ ./scala\n```\n\nAdd code to ``~/.zshrc`` or ``~/.bash_profile``:\n\n```bash\nexport PATH=\"/usr/local/scala/bin:$PATH\"\n```\n\nNow run Scala in the terminal:![scala](./Note/pic/scala.png)\n\n\u003e It seems there will be problem if install Scala 2.11.12.\n\n\n\n### Homework Details\n\n|            | Setting                                        | Duration Benchmark (sec)                 | Local Duration (sec)                | Result Benchmark                                  | Local   Result                                     |\n| ---------- | ---------------------------------------------- | ---------------------------------------- | ----------------------------------- | ------------------------------------------------- | -------------------------------------------------- |\n| HW2 Task 1 | Case1: Support=4\u003cbr /\u003eCase2: Support=9         | Case1: \u003c=200\u003cbr /\u003eCase2: \u003c=100           | Case1: 7\u003cbr /\u003eCase2: 8              |                                                   |                                                    |\n| HW2 Task 2 | Filter Threshold=20\u003cbr /\u003eSupport=50            | \u003c=500                                    | 14                                  |                                                   |                                                    |\n| HW3 Task1  | Jaccard similarity                             | \u003c=120                                    | 12                                  | Recall\u003e=0.95\u003cbr /\u003ePrecision=1.0                   | Recall=0.99\u003cbr /\u003ePrecision=1.0                     |\n| HW3 Task2  | Model-Based: rank=3, lambda=0.2, iterations=15 | Model-Based: \u003c=50\u003cbr /\u003eUser-Based: \u003c=180 | Model-Based: 17\u003cbr /\u003eUser-Based: 13 | Model-Based RMSE: 1.30\u003cbr /\u003eUser-Based RMSE: 1.18 | Model-Based RMSE: 1.066\u003cbr /\u003eUser-Based RMSE: 1.09 |\n| HW4 Task1  |                                                | \u003c=500                                    | 210                                 |                                                   |                                                    |\n| HW4 Task2  |                                                | \u003c=500                                    | (Unrecorded)                        |                                                   |                                                    |\n\n\n\n#### HW1\n\n- Task 1: use ``user.json``\n\n- Task 2: use ``user.json``\n- Task 3: use ``review.json`` and ``business.json``\n\n\n\n#### HW2\n\n- Task 1: use A-Priori \u0026 SON algorithm to find all possible frequent itemsets\n  - for ``small2.csv`` case 1 with ``support=4``, local test shows ``minPartition=3``, ``A_priori_short_basket()`` works better. Local test takes 7 seconds. \n  - for ``small1.csv`` case 2 with ``support=9``, local test shows ``minPartition=2``, ``A_priori_long_basket()`` works better. Local test takes 8 seconds. \n  - ``A_priori_long_basket()`` optimizes the process of generating itemset size  $k+1$  from itemset size  $k$\n  \n- Task 2: \n\n  - collect frequent singleton as well as frequent pairs using brute-force (emit all possible singletons/pairs in each basket, then filter using ``support``)\n  - Then delete all baskets with ``size=1`` or ``size=2`` (delete around 16000 such baskets), which helps to speed up for later steps\n  - using A-priori only to find candidate itemset with ``size\u003e=3``\n\n  - use ``A_priori_long_basket()``, local test shows ``minPartition=3`` works better. Local test takes around 17 seconds.\n\n#### HW3\n\n- Task1: min-hash \u0026 LSH to find similar business_id pars\n\n  - Jaccard similarity: \n\n    - Use ``mapPartitions()`` instead of ``.map()`` for most ``RDD`` operations to speed up\n\n    - local test shows optimal ``numPartitions=5`` using ``sc.parallelize()`` to load input file (sometimes ``parallelize()`` is not large enough to load input file), with \n\n      - pure computation time \u003cu\u003e8 seconds\u003c/u\u003e for the whole process (``load data``$\\to$``min-hash``$\\to$``LSH``$\\to$``compute similarty``$\\to$``write result``), \n\n      - \u003cu\u003escript running time 12.4 seconds\u003c/u\u003e (use ``time spark-submit script.py``, cpu time)\n\n      - \u003cu\u003eprecision=1.0\u003c/u\u003e, \n\n      - \u003cu\u003erecall=0.99\u003c/u\u003e.\n\n    - local ``numPartitoins``-``data load method`` experiment results in [task1 experiment log file](./Homework/Assignment3/pysrc/experiment_time.txt)  ([experiment script](./Homework/Assignment3/pysrc/task1_local_experiment.py))\n\n    - Question: ``sc.textFile()`` with  customized ``minPartitions`` works similar to ``sc.parallelize()`` with customized ``numPartitions``, so what's the difference? (not clear after searching on Google)\n\n  - Cosine similarity: optional, not implemented\n\n- Task2: Collaborative filtering\n\n  Detail and tips see [implementation description file](./INF553_HW3_siqi_liang_description.pdf)\n  \n- Model-based\n  \n- User-based\n  \n  - __Use global average when calculating similarity rather than co-rated item average!!!!!__ (Lower RMSE in this case)\n  \n  \u003e - ``statistics.mean(list)`` is slower than ``sum(list)/len(list)``!!!!!! After replacing ``statistics.mean()`` with ``sum(list)/len(list)``, local user-based test time is around 70s (150s before replacement)\n  \u003e - It seems if we use user_avg as the prediction for all pairs, RMSE\u003c1.07 on ``yelp_test.csv``__?!?!?!?!?!?!?!?!!?__ \n  \u003e - \n\n\n\n#### HW4\n\n#### HW5\n\n#### Final Competition\n\n|                    | Submission 1                                                 | Submission 2                                                 | Submission 3                                                 |\n| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |\n| Val RMSE           | 1.019139997                                                  | 1.002121513                                                  | 0.9807033701                                                 |\n| Test RMSE          | 1.015982788                                                  | 1.000238924                                                  | 0.9793494612                                                 |\n| Val Duration       | 16s                                                          | 42s                                                          | 241s                                                         |\n| Method             | Use weighted average rating on users as well as business. Then combine them together using 1:1 weights again. | Use global average rating on users as well as business. Then combine them together using 1:1 weights again. | Both user.json and business.json are used to generate user_features.csv and business_features.csv for later model. Then use business features 'business_star', 'latitude', 'longitude', 'business_review_cnt', and user features; 'user_review_cnt', 'useful', 'cool', 'funny', 'fans', 'user_avg_star' to train the Gradient Boosting model. |\n| Error (\u003e=0 and \u003c1) | 96910                                                        | 97892                                                        | 102013                                                       |\n| Error (\u003e=1 and \u003c2) | 37451                                                        | 36978                                                        | 32998                                                        |\n| Error (\u003e=2 and \u003c3) | 7051                                                         | 6682                                                         | 6229                                                         |\n| Error (\u003e=3 and \u003c4) | 632                                                          | 492                                                          | 804                                                          |\n| Error (\u003e=4)        | 0                                                            | 0                                                            | 0                                                            |\n\n- Top 3 test RMSE in the class:\n  - 0.9750778569\n  - 0.9773295973\n  - 0.9784191539\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagentds%2Finf-553","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagentds%2Finf-553","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagentds%2Finf-553/lists"}