!DOCTYPE html>
56:219:522 data processing (data management for data science)
cross listed with 56:824:718 data management and 56:834:650 special problems
sp25 tue 6-8.50 bsb336
https://theaok.github.io/datManPy most current syllabus (updated continuously)(stata version of this class)
datman@googlegroups.com listserv (everyone in class gets
these; if you didn't get welcome email/can't post to the group do email me adam.okulicz.kozaryn@gmail.com to add you!
)
- instructor: Adam Okulicz-Kozaryn adam.okulicz.kozaryn@gmail.com
- office: 321 Cooper St, room 302; office hours: Thu 1-2, and by appointment
- or just stop by: this semester I am in most of Tue and Thu
prerequisites
You need to be comfortable using a computer. Knowledge of Python (R,
Stata, etc) and data-management/computer science is helpful but not
necessary. We will cover the basics.
note to social science/humanities students:
This class is mostly coding/programming/scripting.
If you do not like programming, this class is not for you. But you may not yet know
whether you like it and you may start liking it in this class: it often happened before!
course description and learning objectives/outcomes
data (sources, best practices, tips and tricks): this class
is as much about Python as about data (you'll use the data you chose
that will serve you well beyond this class!!)
hands-on/applied Python/Pandas (there will be some theory, too)
You will learn how to manage your data: clean, organize, manipulate
and automate: eg: data types, text/math functions, recoding,
documentation, merging, reshaping, loops, if/else.
Use AI to write code, but do edit it! Reduce bloat; improve it, play with it, and make sure you understand it!
required textbooks and materials
No required textbooks. All required materials (code, readings) will be provided.
software
Python >=3.10
(python.org). Can
download for free for Linux, Win, Mac. We will use several libs,
mostly Pandas.
BUT no need to download or install anything: we will run
Python online (in webbrowser in the cloud), so called "colab" (2
sections down). But first GitHub.
GitHub
We will use GitHub to store Python code (.py) in form of a
notebook (.ipynb), and we will edit (and run) the notebook in colab (next sec).
sign up or login at github.com
(depending on os/browser) on top left hit "New" or "Create
Rpository" or top right under plus "+" select "New repository"
pick some repository name, say "datman"
; keep
selected 'Public'; important!: under "Initialize this repository
with" check "Add a README file"; and hit at the bottom "Create repository"
then hit "Settings" towards the middle-top right; on the left select
"Collaborators" tab and hit "Add people" : "theaok", and hit "Add theaok to this repository"
workflow: my comments, diffs, inline response [lets go over this next week again]
i will run it in my Colab, edit, and upload back
diff and response to my comments: actually cleaner and better in
colab: File-Revision history; or clunky in GitHub:
can click my commit message and see the so called
diff--the difference between your version and my version: important!
do make sure to fix it up for next ps, you may even have inline
response to my comments in your next ps (especially if sth complex
or if you disagree)
dont forget about a meaningful commit message--can keep on
uploading newer versions as many times as you like
note: when you click the file, you can then click 'History' and
see how the file evolved over time :)
file naming: ps1.ipynb, ps2.ipynb, etc, or
ps1, ps2, etc sections in one file; or just one file and keep it updating throught with new stuff as we go!
colab
Just run Py notebook in Colab and save subsequent versions in
Github that will keep track of changes [stick with this for the ps]
go
to https://github.com/theaok/datManPy/blob/main/pandas.ipynb
and hit 'open in colab'
OR go
to colab.research.google.com
and on popup pick GitHub, search for:
https://github.com/theaok/datManPy/blob/main/pandas.ipynb
(it should find it and load it into colab, and
follow instructions at the top of the file, ie save it in your
GitHub etc)
best projects:
https://colab.research.google.com/github/ewattudo/datamanagement/blob/main/PS5.ipynb
https://colab.research.google.com/github/Jonchyk/Datamgmt/blob/main/PS5_Vis_Grouping.ipynb
data
The class is a bit like an independent study: you will carry out some
very basic research. You do need your own data for this class ASAP: the more data and the more
complex, the better. Software will need to load the data straight up
from online. Some data easily downloadable from online
eg https://gss.norc.org/get-the-data/stata.
But many not. Then you have to put data online yourself [just go over Git<25mb]:
https://theaok.github.io/generic/howToPutDataOnline.html
https://www.libraries.rutgers.edu/subject-librarians?keyword=&division=All&unit=All&specialization=351
icpsr: biggest repository of survey data; check out also var search
google is great for data search; and it has data search, too
google cloud/big query has data ,too
kdnuggets
listing of sources; kdnuggets great in general for data
science; maybe start here, easier to wrap your head around
another kdnuggets listing
yet another one: maybe esp FiveThirtyEight and Reddit
kaggle
NOAA
NASA
datsets on GitHub
datahub
pew
grading
2 keys to success: start early AND ask often many questions; (and study groups: get couple people on zoom, screenshare notebooks, etc) This is a
software class. It is different from typical soc sci classes! You will get
stuck often and whenever stuck, email listserv, ask me, ask your
classmates, as opposed to pulling
your hair out! And stop by my office, too. Googling (and built-in Gemini) solves most
problems but for many things its better to talk to me and your
classmates; also more social/human, if you talk to computer all the
time, its not healthy.
100% (5ps x 20%) problem sets [just Py notebook], may cowrite code (upto 2 people) but then
the project should be 2 times better than a single-authored one
bonus/extra upto 5% engagement, class participation
eg answering/asking questions, helping others, listserv
discussions
bonus/extra upto 5% civic engagement (see bottom of the syllabus)
calendar
[*] = bonus (extra/not required)
sp25: i have family emergency, probably after jan28 2 classes on zoom
ps0.pdf
pandas.ipynb
see some vids, can see screen with good resolution for coding steps:)
intro.pdf
replication.pdf
!!zoom only!! jan28 I/O (Input/Output) and basic descriptive statistics
vid
old vid
ps1.pdf
find_data.pdf
data.pdf
pandas.ipynb
!!zoom only!! feb4 manipulate data
vid
old vid
ps2.pdf
note: added \#3 to ps1: 'do some manipulations such as subset/slice on condition, filter vars or obs using regexp, and groupby/agg'
lets start with diffs in colab: File-Revision History: uncheck show output:
https://colab.research.google.com/github/worldterminator/worldterminator/blob/main/ps0.ipynb
and https://colab.research.google.com/github/nhs47/DatPro/blob/main/ps0_Nabiha.ipynb
early/bonus/volunteer present/go through ps1 esp des sta and interpretations
manipulate.pdf
pandas.ipynb
manipulate; and dive into merge (1st basic example)
pandas.ipynb: merge
flip a class work on ps2: (I walk around and sit with each of you; Q and A; otherwhise I look at your colabs, and then approach you with ideas)
if time do real world examples from next class
feb18 real world examples and plotly
vid
old vid
old vid
ps3.pdf
QaA and go over your ps1 and ideally ps2
chetan, diff in colab, File-Revision history, uncheck Show output
real
world data management (eg mapping/recoding urbanicity) example (covid city paper)
datasets of the week: usda ers; irs soi county-to-county; nj ag use of force data
real
world merge example: Eric;
and another example, Xiao
https://colab.research.google.com/github/theaok/vis/blob/main/plotly.ipynb
merge is typically a necessary initial step,
but usually the final step is to explore the new relationships
[*] pandas fancy stuff and other fancy stuff (also focus on your projects, discuss, brianstorm, flip the class)
feb25 wrap up pandas and pandas extra topics (slow down and flexible: you chose
what to learn)
vid
old vid
old
vid
old
vid
ps4.pdf
go over ps2, go over merge investigation/interpretation again!,
QandA on merge, flip the class work on ps3
wrap up all of pandas
extra topics
mar4 ps3 presentations; and profiler, imputations, fuzzywuzzy
vid
old vid
present ps3: 10min sharp + 10min discussion; focus
on interesting stuff like research question, data, variables, relationships:
descriptive stats and visualizations; skip boring stuff like
subseting and renaming
profiler, imputations, fuzzywuzzy
mar11 theory and flip the class work on ps4
vid
old vid
theory.pdf
sai and chetan present (also see their ind stu, how different
they are?); and go over listserv ps3 comments
flip the class work on ps4: slow down, focus on your projects, redo/improve/polish
mar18 no class sp break
mar25 ps4 presentations
vid
old vid
revisit theory from last class
presentation: no need for slides, just the notebook: 15min
sharp + 15min discussion/q and a
ps5.pdf
factor analysis
theory.pdf: 2nd sec: CS stuff
chetan presentation
apr8 review, q and a vid
Srija presentation
go through code from earlier, focus on whats underused, need
more elaboration etc like: missing obs/duplicates (eg profiler),
groupy agg, recode/map, merge, imputations, apis/fred
check out my python notebooks for research; you should use the data you produced in this class to write a paper: publish or perish:
pop growth and happiness pdf
and colab
covid and happiness
pdf
and colab
flip much of the class work on ps5/final project
shell and AI ideas
Eric joins via livestream https://rcit.rutgers.edu/av-request/live/08699-1-2025
and via zoom
https://pwa.zoom.us/wc?mn=8892839953&pwd=dFhiTE1BZVlnMXdWSWN6d3N3MXI0QT09
if time: i will pull up your latest and go over it
apr22 wrap up, summarize
vid
wrap.pdf
ad http:theaok.github.io/swb
revisit theory
Srija present sentiment analysis
pull up instructive chunks of code from: ???
i fork couple best repos as example for future classes
15min sharp + 15min discussion/q and a
just to be safe, delete the data you have posted online, you never know: someone may be picky about it
rules
do not share or link to class videos!
These videocasts and podcasts are the exclusive copyrighted property of Rutgers University and the Professor teaching the course. Rutgers University and the Professor grant you a license only to replay them for your own personal use during the course. Sharing them with others (including other students), reproducing, distributing, or posting any part of them elsewhere -- including but not limited to any internet site -- will be treated as a copyright violation and an offense against the honesty provisions of the Code of Student Conduct. Furthermore, for Law Students, this will be reported by the Law School to the licensing authorities in any jurisdiction in which you may apply to the bar.
attendance
Attendance is required: if you cannot attend without documented emergency you will lose participation credit; either way do let me know ahead of time; we can put you on zoom so you can participate Be advised that you are
responsible for any material covered in the class, whether or not it was in the readings or
lecture notes. You are also responsible for any announcements made in class. For most
students, attendance is simply essential to learning the material. If you do need to miss a
class, be sure to consult with a fellow student to learn what transpired.
incompletes: Generally speaking, the material in this course is best learned as a single unit. I
will grant incompletes only in cases where a substantial change in life circumstances occurs that
is beyond the control of the student, and only with appropriate
documentation.
study groups. You are encouraged to form a regular study group. Many students over the years
have found the study groups to be very helpful. Study groups are permitted and encouraged to
work on the problem sets together. However, each individual student should write up his or her
own answer to hand in, based on his or her own understanding of the material. Do not hand in a
copy of another person’s problem set, even a member of your own group. Writing up your own
answer helps you to internalize the group discussions and is a crucial step in the learning process.
Academic Integrity. I am very serious about this. Make no
mistake--I may appear accommodating and informal--but I am extremely
strict about academic integrity. Violations of academic integrity include cheating on tests or handing in
assignments that do not reflect your own work and/or the work of a study group in which you
actively participated. Handing in your own work that was performed not
for this class (e.g. other class, any other project) is cheating,
too. I have a policy of zero tolerance for cheating. Violations will be referred
to the appropriate university authorities.
For more information see http://fas.camden.rutgers.edu/student-experience/academic-integrity-policy
Accommodating Students with Disabilities.
Any student with a disability affecting performance in the class
should contact the disability office ASAP:
https://success.camden.rutgers.edu/success-services/disability-services/
civic engagement component (opportunity for extra credit!)
Start early. Start thinking about how you want to engage civically
today.
Universities and social science should serve society.
You are encouraged have to engage with local community.
The idea is that you engage civically using research methods. There are several
ways to do it. Ideally, you will partner with a local organization,
obtain data from them, do some analysis, and present results to them. You may also use government data, say from census bureau, and present relevant
information to locals. A local organization can be Rutgers research
institute such as WRI, CURE, LEAP or any other organization such as
school or soup kitchen or CamConnect. Rutgers Office of civic
engagement may be able to help
you contact them. The key idea is partnership: you will use tools
from this class to produce output useful to local community. This
is similar to taking a role of an apprentice at a local organization
or serving as a consultant.
Using
real world data poses challenges, which is a part of
exercise. Presenting your findings to stakeholders outside of a class
is also challenging. At the same time, it is fairly easy to contribute
locally by using simple tools learned in this class. For instance,
simple comparison of means between two schools in Camden can be
revealing and helpful locally.
An obvious way would be to use data at your workplace or at a
workplace of someone you know. However, you need to make sure that it
serves society in some way. For instance, it would be straightforward
if you work at a hospital or school or fire department; but it would
be difficult if you work at Starbucks.