https://github.com/achint08/tech-diffusion
Patents data analysis on PySpark
https://github.com/achint08/tech-diffusion
bert big-data bigquery google-patents-dataset machine-learning nlp pagera patents-analysis pyspark
Last synced: 4 months ago
JSON representation
Patents data analysis on PySpark
- Host: GitHub
- URL: https://github.com/achint08/tech-diffusion
- Owner: Achint08
- Created: 2022-04-25T07:23:44.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-05-28T00:32:58.000Z (about 4 years ago)
- Last Synced: 2025-10-04T00:31:18.682Z (8 months ago)
- Topics: bert, big-data, bigquery, google-patents-dataset, machine-learning, nlp, pagera, patents-analysis, pyspark
- Language: Jupyter Notebook
- Homepage:
- Size: 610 KB
- Stars: 2
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# patents-analysis
In this project, we are trying to analyze Technology diffusion using big data on Pyspark.
In simple terms, we are trying to analyze how companies/organizations depend on each other using patents dataset and their records of citations.
The module consists of the following:
1. General analysis of dataset, like top 10 cited companies, top 10 patents producing companies.
2. Graph analysis - Pagerank, Strongly connected components.
3. Text analysis of patents abstract - LDA model
4. Machine learning model to predict whether company A will cite company B's patent - BERT, Naive Bayes, Multilayer Perceptron, Random Forest.
Note - We are only analyzing top 100 companies for each year.
## What is Technology Diffusion?
* The way by which innovation is disseminated through certain channels over time among the organizations.
* Citations provide useful insight for technology dissemination processes, as patents are an important medium of invention.
* Relation between innovation happening in organizations and how they are inter-dependent.
## Dataset & Infrastructure
The dataset used is Google Patents dataset.
* Platform - GCP
* 1 master, 3 worker nodes
* standard-m1 instance
* 30 GB Memory
* 8 CPU cores
* Dataset size – 20 million rows (patents from 1976)
* Tables -
* Patent
* Patent citation
* Assignee
## Predictions
* Naïve Bayes – Accuracy about 57%
* Multilayer Perceptron – Accuracy about 98%
* Decision Tree – Accuracy about 93%
* Random Forest – Accuracy about 95%
## Other Contributors
- Simron Waskar
- Sumit Dhundiyal
- Zexu Li
## Conclusion
* Technology diffusion exists and has been increasing year by year.
* IBM & Samsung are the top most innovation hub in the last decade.
* We can predict the citation by an organization with certain accuracy for top 100 companies.
* The topics extracted from documents show that technology trends are highly reflected in the abstracts/titles.
## Thank you. :)