An open API service indexing awesome lists of open source software.

https://github.com/michaelkinfu/etd-topic-modeling

The electronic theses and dissertations topic modeling project was conducted by the Chinese University of Hong Kong Library.
https://github.com/michaelkinfu/etd-topic-modeling

bertopic clustering digital-scholarship topic-modeling

Last synced: 3 months ago
JSON representation

The electronic theses and dissertations topic modeling project was conducted by the Chinese University of Hong Kong Library.

Awesome Lists containing this project

README

        

# etd-topic-modeling
The electronic theses and dissertations topic modeling project was conducted by the Chinese University of Hong Kong Library (CUHK Library).
This digital scholarship project is strictly intended for nonprofit and academic purpose.

## Research Cycle
- Data Collection: The data were cataloged in previous years. Our team extracted this valuable data from Alma by conducting simple queries.
- Data Processing: Based on the thesis titles and subject headings created by the cataloguer, our team extracted the titles related to Hong Kong from the data.
- Topic Modeling: Base on sentence embedding, discover similar titles.
- Clustering: Divide all the titles into five different clusters using the K-means algorithm.

## Observation
### Trends in Hong Kong-related Theses
Based on the extracted data, our team found that there were a total of 20,423 theses in the ETD collection. Of these, 3,878 were related to Hong Kong studies. This indicates that approximately 20% of the total postgraduate theses are associated with the theme of Hong Kong. Base on the dataset, our team had below findings:

- The total number of theses showed an increasing trend over time. (Figure 2)
- The proportion of Hong Kong-related theses exhibited a decrease among the overall research topics. (Figure 1)
- A change point was identified in the year 1995. (Figure 2)

In concluded, the gap between the number of Hong Kong-related theses and the total number of theses increased. This indicates a significant divergence in research topics, with a lesser emphasis on Hong Kong-related subjects post-1995. Our team has a hypothesis that this may be related to two reasons:

- There has been a significant increase in the number of research theses across various disciplines in recent years.
- The CUHK Business School has shown less interest in Hong Kong-related fields.

#### Figure 1 - Proportion of Hong Kong Related Thesis
![alt text](/image/statistic1.png)
#### Figure 2 - Compared the Trends
![alt text](/image/statistic2.png)

### Five Clusters
By employing k-means clustering, the ETD theses were segregated into five distinct clusters ([Figure 3 - HTML](/image/demo.html)):

- Marketing and Business: 1115 titles
- Cultural and Political: 792 titles
- Urbanization and Land use: 788 titles
- Population and Chinese: 656 titles
- School and Education: 527 titles

#### Figure 3 - Five Clusters of ETD Collection
![alt text](/image/clustering.png)

## Acknowledgement
CUHK Digital Repository (ETD Collection): https://repository.lib.cuhk.edu.hk/en/collection/etd

BERTopic website: https://maartengr.github.io/BERTopic/index.html