Project-1 Resubmission
20th nov,2023
Hii,
In todays analysis, that I learned about the time series models for predecting the future data.
- SARIMA (Seasonal Autoregressive Integrated Moving Average): This model extends the ARIMA model by incorporating seasonality. SARIMA is useful in datasets where seasonal patterns are prominent, and it includes additional seasonal parameters to account for these patterns.
- VAR (Vector Autoregression): The VAR model is used for multivariate time series data, where the system captures the linear interdependencies among multiple variables. It’s particularly useful when you want to model and predict systems where variables influence each other.
- LSTM (Long Short-Term Memory): LSTM is a type of recurrent neural network (RNN) particularly effective in learning order dependence in sequence prediction problems. This model is well-suited for time series data where there are long-term dependencies or patterns.
- ARIMA (Autoregressive Integrated Moving Average): ARIMA is one of the classic models used for time series forecasting. It combines autoregressive features with moving averages and integrates differencing to make the time series stationary, making it effective for a wide range of datasets.
15th Nov,2023
Today I analysis the economic indicators from a dataset of anylize Boston. There are 264 datasets. This is a legacy dataset of economic indicators tracked monthly between January 2013 and December 2019 from the Boston Planning and Development Authority (BPDA) tasked with planning for and guiding inclusive growth within the City of Boston. And I studied the example of Time series data.
13th nov,2023
In todays class we discussed about the time series data and forecasting. The time series data refers to a sequence of data points collected or recorded at regular time intervals. This kind of data is prevalent in various fields such as economics, finance, environmental science, and more. The key characteristics of time series data include trends, seasonality, and cyclic patterns.
Sequential Nature: The primary characteristic of time series data is its sequential order. Unlike other types of data, the order in which the data points are arranged is crucial because it reflects the progression of time.
Components of Time Series:
- Trend: It represents the long-term progression of the series. Trends can be upward, downward, or even sideways over time.
- Seasonality: These are patterns that repeat at regular intervals, such as hourly, daily, weekly, monthly, or annually. For example, increased ice cream sales during summer months.
- Cyclic Patterns: Unlike seasonal patterns, cyclic patterns occur over longer periods and are not fixed to a particular time frame. They are often related to business or economic cycles.
- Irregular or Random Components: These are unforeseeable variations that are not part of the trend, seasonality, or cyclic components. They often result from unforeseen or random events.
Project report-2 12th Nov, 2023
8th nov. Decision Trees
In today’s lesson, I learned about decision trees, which are visual models of the decision-making process. Imagine them as a flow of queries and choices culminating in a definitive result. The process begins with an initial query, and with each subsequent answer, you proceed along the pathways until you reach the final outcome.
- Root Node: This is the first node of the tree where the data splits. It represents the entire dataset, which then gets divided into two or more homogeneous sets.
- Splitting: It is the process of dividing a node into two or more sub-nodes based on certain conditions.
- Decision Node: When a sub-node splits into further sub-nodes, it’s called a decision node.
- Leaf/Terminal Node: Nodes that do not split any further are called leaves or terminal nodes. They represent the output or the decision taken after computing all attributes.
- Branches or Sub-Trees: A section of the entire tree is called a branch or sub-tree.
- Parent and Child Node: A node, which is divided into sub-nodes, is called the parent node of the sub-nodes, whereas the sub-nodes are the children of the parent node.
- And we done the logistic regression to given data and we written our code.
1st nov, 2023
Today we analysis the data of washington post data and we observe there’s a positive correlation between age and the not
flee status. This might indicate that older individuals are less likely to flee or resist, but without a specific correlation coefficient, it’s hard to quantify the strength of this relationship. There is a negative correlation between the not
flee status and incidents related to mental illness. Even though the correlation is weak, this could suggest that individuals with mental illnesses might be slightly more likely to flee. However, this connection is not robust.
He strength of these correlations, as you mentioned, is weak. This means that while there might be some relationships between flee status and other factors, these relationships are not dominant or necessarily predictive. It would be essential to consider other potential confounding variables or factors that could impact these relationships. Additionally, correlation does not imply causation. So, even if there’s a relationship between two variables, one doesn’t necessarily cause the other.
23rd oct,2023
Todays class we discussed about on various clustering methods, specifically comparing and contrasting three prominent ones: K-means, K-medoids, and DBSCAN.
- K-means Clustering:
- K-means is a partitioning method that aims to divide a dataset into K distinct, non-overlapping clusters.
- It is a centroid-based approach, where the data points are assigned to the cluster with the nearest centroid.
- K-means has the drawback of being sensitive to the initial placement of centroids and may not work well with non-globular clusters.
- K-medoids Clustering:
- K-medoids is another partitioning method, similar to K-means, but it uses medoids instead of centroids.
- A medoid is the data point within a cluster that minimizes the dissimilarity to all other points in that cluster. It is more robust to outliers than centroids.
- K-medoids is less sensitive to the initial choice of medoids and works well with non-globular clusters.
- K-means and K-medoids are more suitable for datasets with well-defined, globular clusters, while DBSCAN is better at handling clusters of irregular shapes.
- The choice of clustering method often depends on the nature of the data, the desired number of clusters, and the tolerance for outliers.
20th 0ct, 2023
Today’ s session that we clarified some doubts and I deeply look into the washington post data and I was noted that a significant portion of the individuals involved in the incidents did not make attempts to escape. Those who were armed with firearms frequently chose to flee either on foot or in vehicles. Conversely, individuals armed with knives tended to remain at the scene, although if they did decide to flee, they typically did so on foot. Unarmed individuals often made efforts to escape, utilizing either on-foot or vehicular means. Furthermore, those who employed vehicles as weapons often used them as a means of escape from the scene.
18th oct, 2023
Todays class we analysis the washington post data and Specifically, we explored the age distribution of people who have lost their lives in police-related incidents. During our discussion, we formulated several important questions to guide our project or research, such as:
What does the age distribution of individuals killed by the police look like over the years, and are there any noticeable trends or patterns?
Can we identify any potential correlations between age and other variables, like the location of incidents, the time of day, or the circumstances of the encounter with law enforcement?
How is the data suitable for a logistic regression model? What characteristics of the dataset make it conducive for logistic regression analysis?
16th oct, 2023
Today, we deeply analysis of the Washington Post data repository concerning fatal police shootings in the United States, with a particular focus on the locations where these incidents have occurred. This data repository is a valuable resource that provides comprehensive information on police-involved shootings, offering critical insights into a complex and pressing issue within the country.
we study and examining the dataset, which includes a wide array of variables, such as the date, location, race of the individuals involved, and various circumstances surrounding each incident.
The location of these incidents plays a significant role in understanding the distribution and potential disparities in police shootings across the country. By plotting this data on maps, we were able to identify regions with higher incidence rates.
13th oct, 2023
In today’s class that we discussed in given data we analysis the locations of the shootings. And we had a discussion about the Washington Post’s data repository on fatal police shootings in the United States. This data repository is a valuable resource that provides comprehensive information about instances of fatal police shootings across the country. The discussion centered around the content and significance of this repository, focusing on two key aspects: the data on fatal police shootings and the geographic locations of these incidents.
The Washington Post has been meticulously collecting and maintaining data on fatal police shootings in the United States for several years. The data include details about each incident, such as the date, time, location, the individuals involved, and the circumstances surrounding the shooting. This information is essential for gaining insights into the prevalence and patterns of such incidents, including factors like the race, age, and gender of the victims, as well as the types of weapons involved. And we worked on given Data.
Analysis the Data of The Washington Post, 11th Nov, 2023
Today, we conducted an analysis of the Washington Post data repository on fatal police shootings in the United States. This data source is a comprehensive collection of information related to fatal encounters between law enforcement officers and civilians, which is meticulously tracked and maintained by The Washington Post. Our analysis aimed to gain a deeper understanding of this complex issue and uncover meaningful insights.
And the given data was complicated and we figured some questions. we aimed to provide a comprehensive analysis of the Washington Post data repository on fatal police shootings in the United States, offering insights that can contribute to a better understanding of this issue and inform discussions and decision-making in the realm of law enforcement and criminal justice reform.
Linear Regression, Project Report 1
6th oct, 2023
Today was a productive day in our project as we dedicated our time to writing code and implementing various linear regression models. Our project revolves around predictive analytics, and linear regression models play a crucial role in this endeavor.
The day began with a team meeting to discuss our progress and plan our coding tasks. We divided our responsibilities among team members, each focusing on different aspects of the project. Our linear regression models were performing admirably, providing valuable insights and predictions.
And Today I have the learned more depth about the linear Regression models and the P-value
4th oct, 2023
In today’s class, we had a focused on our project. We dedicated a significant portion of our time to addressing questions and concerns related to our project. This discussion allowed us to clarify any uncertainties and make sure that everyone was on the same page regarding the project’s goals, scope, and objectives.
We began working on the actual implementation by starting to write the code. This marks an important milestone in our project’s development, as it signifies the transition from planning and conceptualization to the hands-on creation of our project.
And also we discussions of our project and in that time of discussions we have a lot of queries and we clarified that with our professor.
2nd oct,2023
Today, I worked on my project findings and continued working diligently on my project, which involves analyzing the CDC 2018 data. And we develop the code.
and I also watched the videos of Train Error and Test error in depth.
Train error:
Train error is typically used during the training phase to optimize the model’s parameters to minimize this error. The goal is to make the model perform as well as possible on the training data.
Test Error:
Test error is a critical metric because it provides an estimate of how well the model is likely to perform on real-world, unseen data. The lower the test error, the better the model’s generalization ability.
29th, sep, 2023
Today, we made significant progress on your project. We dedicated our time to working on your project codebase and delving deeper into the concept of cross-validation.
And I also watched the videos of cross validation and I learnt the Cross-validation is a robust technique used to evaluate and validate the performance of machine learning models. It involves splitting your dataset into multiple subsets or “folds” and systematically training and testing your model on different combinations of these folds.
The day was marked by significant progress on your project, refining your code, and acquiring a deep understanding of the crucial concept of cross-validation.
Test Error-27th, september, 2023
In today’s class, we delved into the fascinating world of cross-validation and test error analysis applied to polynomial models using a dataset containing 354 data points related to diabetes.
There are 354 Data sets for which we have records of all 3 variables (obesity, inactivity, diabetes) and I learnt about the K- flod cross validation.
K-fold cross-validation is a widely used technique in machine learning for assessing the performance and generalization ability of a model, especially when you have a limited amount of data. The dataset is divided into K roughly equal-sized subsets or “folds. The standard choice for K is 5 or 10, but you can choose other values based on the size of your dataset and computational resources. and I also worked on my project and we are working on the given Datasets.
25th, september, 2023
In today’s class, I learned about cross-validation, bootstrapping, and k-fold cross-validation.
K-Fold Cross-Validation:
K-fold cross-validation is a crucial technique in machine learning and data analysis. It involves dividing the original dataset into “k” subsets or folds, each of roughly equal size. The model is then trained and evaluated “k” times, with each fold taking turns as the validation set while the remaining folds serve as the training data. This comprehensive process ensures that every data point participates in both training and validation, leading to a more reliable assessment of model performance. By averaging the results from each iteration, K-fold cross-validation provides a robust estimate of how well a model generalizes to unseen data.
Bootstrapping: Bootstrapping is a resampling technique used for statistical inference. It involves creating multiple random samples (with replacement) from a given dataset.
And I watched the videos of the cross validation right and wrong ways. I learnt a little bit. Also I worked on my project.
22nd Sep,2023
Today I worked on my project. Initially, I embarked on this project with a basic understanding of machine learning and data analysis. However, as I progressed, I realized that simply training a model on a dataset and I learnt depth about the P-value and T-test.
I learnt about the Cross validation and validation set approach. One crucial component of cross-validation is the validation set. This set is used to assess the model’s performance during training and fine-tuning. It acts as a sort of checkpoint, helping me adjust the model’s hyperparameters and detect any issues early in the development process. By separating a portion of my dataset for validation, I gained a clearer understanding of how well my model was learning from the data.
20th september,2023
During today’s class, we delved into an intriguing aspect of crab molt model. This area of study revolves around examining the patterns and behaviors associated with molting in crabs, shedding light on their growth and developmental processes.
Crab Molt Model: The crab molt model serves as a conceptual framework utilized by researchers to gain a deeper understanding of the molting process in crabs.
Premolt Data: Premolt data encompasses information and observations gathered from crabs in the period leading up to their molt. During this phase, crabs often undergo noticeable behavioral changes and physiological adjustments.
Postmolt Data: In contrast, postmolt data refers to information collected from crabs immediately after they have molted and are in the process of hardening their new exoskeleton.
T-Test: The T-test emerges as a valuable statistical tool frequently employed in scientific research to determine whether there exists a significant difference between two sets of data. In the context of the crab molt model, T-tests can be applied to assess whether there are statistically significant disparities among various parameters within premolt and postmolt data. and I also I am working on the project
18th September20
Simple Linear Regression:
A dependent variable (Y) and an independent variable (X) are modelled using simple linear regression, a statistical technique. It presumes that there is a linear relationship between them, meaning that changes to X cause proportionate changes to Y. The objective is to identify the line that minimises the total of squared differences between the observed data points and the line’s projected values, typically denoted as Y = axe + b.
In a simple linear regression, ‘a’ stands for the slope of the line, which reflects how much Y changes for a one-unit change in X, and ‘b’ stands for the intercept, which represents the value of Y when X is zero.
Multiple Linear Regression
The extension of simple linear regression to incorporate many independent variables is known as multiple linear regression. It simulates the relationship between a number of independent variables (X1, X2, X3, etc.) and a dependent variable (Y). The formula for the equation is written as Y = a1X1 + a2X2 + a3X3 +… + b, where ‘a1’, ‘a2’, ‘a3’, etc., are the coefficients that indicate the influence of each independent variable on the dependent variable, and ‘b’ is the intercept.
We can examine the combined effects of numerous predictors on the response variable using multiple linear regression. It is commonly used to make predictions and comprehend complex relationships in a variety of disciplines, including economics, finance, and the social sciences.
sep15th,2023
Before you can use recent classes in your project, it’s essential to understand what these classes are and why they are relevant to your project. Recent classes typically refer to classes or components that have been recently developed. And I learnt depth about the P value and the heteroscedasticity for diabetes.
What is a “P” value?
Certainly! The conversation regarding what a p-value is and how you participated in the MTH522 class are described below:
I went to my MTH522 class today, and the topic of discussion was the basic idea of p-values in statistics. The course was interesting and instructive, offering useful insights into the fields of statistical significance and hypothesis testing.
During the class, we delved into the concept of p-values, which are a vital component of statistical analysis. The instructor explained that a p-value is a numerical measure used to assess the strength of evidence against a null hypothesis. Essentially, it quantifies the likelihood of obtaining the observed results if the null hypothesis were indeed true. Overall, the MTH522 class provided a comprehensive and insightful exploration of p-values, equipping attendees like me with a better understanding of this critical statistical concept and its practical relevance in research and data analysis.
Simple Linear Regression
Today’s first MTH522 class delved into the world of statistics with a focus on simple linear regression using real-world data from the CDC’s diabetes graphs. We began by understanding the fundamentals of linear regression, a powerful statistical tool used to model relationships between variables.
And I listened to the topic that Residuals.Residuals are the differences between the observed values and the values predicted by our linear regression model.Recognizing and addressing heteroscedasticity .
In summary, today’s MTH522 class provided a comprehensive introduction to simple linear regression, using real data from the CDC’s diabetes graphs to illustrate key concepts.
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!