https://github.com/achint08/tech-diffusion

Patents data analysis on PySpark
https://github.com/achint08/tech-diffusion

bert big-data bigquery google-patents-dataset machine-learning nlp pagera patents-analysis pyspark

Last synced: 4 months ago
JSON representation

Patents data analysis on PySpark

Host: GitHub
URL: https://github.com/achint08/tech-diffusion
Owner: Achint08
Created: 2022-04-25T07:23:44.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-05-28T00:32:58.000Z (about 4 years ago)
Last Synced: 2025-10-04T00:31:18.682Z (8 months ago)
Topics: bert, big-data, bigquery, google-patents-dataset, machine-learning, nlp, pagera, patents-analysis, pyspark
Language: Jupyter Notebook
Homepage:
Size: 610 KB
Stars: 2
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# patents-analysis

In this project, we are trying to analyze Technology diffusion using big data on Pyspark.

In simple terms, we are trying to analyze how companies/organizations depend on each other using patents dataset and their records of citations.

The module consists of the following:

1. General analysis of dataset, like top 10 cited companies, top 10 patents producing companies.
2. Graph analysis - Pagerank, Strongly connected components.
3. Text analysis of patents abstract - LDA model
4. Machine learning model to predict whether company A will cite company B's patent - BERT, Naive Bayes, Multilayer Perceptron, Random Forest.

Note - We are only analyzing top 100 companies for each year.

## What is Technology Diffusion?

* The way by which innovation is disseminated through certain channels over time among the organizations.
* Citations provide useful insight for technology dissemination processes, as patents are an important medium of invention.
* Relation between innovation happening in organizations and how they are inter-dependent.

## Dataset & Infrastructure

The dataset used is Google Patents dataset.

* Platform - GCP
* 1 master, 3 worker nodes
* standard-m1 instance
* 30 GB Memory
* 8 CPU cores
* Dataset size – 20 million rows (patents from 1976)
* Tables -
* Patent
* Patent citation
* Assignee

## Predictions

* Naïve Bayes – Accuracy about 57%
* Multilayer Perceptron – Accuracy about 98%
* Decision Tree – Accuracy about 93%
* Random Forest – Accuracy about 95%

## Other Contributors

- Simron Waskar
- Sumit Dhundiyal
- Zexu Li

## Conclusion

* Technology diffusion exists and has been increasing year by year.
* IBM & Samsung are the top most innovation hub in the last decade.
* We can predict the citation by an organization with certain accuracy for top 100 companies.
* The topics extracted from documents show that technology trends are highly reflected in the abstracts/titles.

## Thank you. :)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/achint08/tech-diffusion

Awesome Lists containing this project

README